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ABSTRACT 


There is massive information hidden in the biomedical literature in the form of scientific publications, book chapters, conference 
reports, etc. This information is growing exponentially with the speed exceeding Moore’s Law i.e. observations double in every 
two years. It is therefore not possible for researchers and practitioners to automatically extract and relate information from dif¬ 
ferent written resources. Also the data present in the written recourses is unstructured i.e. free-text therefore it becomes very 
arduous and exorbitant to obtain annotated material for its literature. So in order to overcome these problems Natural Language 
Processing (NLP) and Unsupervised Learning approaches are used. Natural Language Processing approach is the part of text 
mining which is the discovery by computer of new, previously unknown information by automatically extracting and relating infor¬ 
mation from different written resources to reveal the otherwise ‘hidden’ meanings. The Unsupervised Learning approach is the 
part of machine learning where no annotated training is necessary and it is more about exploring the data to find insights. Both 
these approaches can be used to find knowledge from written textual data in the form different interactions like protein-protein, 
gene-gene, gene-protein, etc. These approaches could also be used to develop classifiers, databases, tools or softwares which 
in future would automatically extract the knowledgeable information from literature, answering questions arising in the biomedi¬ 
cal research and would also help in the development of new hypothesis. So here we discuss 53 softwares, tools and databases 
developed using Natural Language Processing (NLP) and unsupervised learning approaches, which are involved in plain texts 
analyzing and processing, categorizes current work in biomedical information and entities extraction. 
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INTRODUCTION 

Textual data is considered as the building block upon which 
any research thrives. The extent of details and the rush of 
data providing information through the advancements in 
technologies and internet have increased tremendously. 
The exponential growth in research for biomedical sciences 
has led to an increase in its publications. The textual data 
in the published literature is unstructured or free-text. The 
unstructured data either does not have a pre-characterized 
information model or is not sorted out in a pre-characterized 
way. The information is commonly text-heavy, but may con¬ 
tain critical information in the form of dates, numbers, and 
facts like protein-protein interactions, gene-disease associa¬ 
tions, etc. as well. As the data both communicated and hid¬ 
den up in biomedical writings are developing exponentially 


and the composed content is unstructured information so it 
isn’t workable for analysts and experts to naturally extricate 
and relate data from various compositions (1) (2). Therefore 
manual effort to transform unstructured text into structured 
is a laborious process and automatic techniques are the solu¬ 
tion (3). There are various automatic techniques for solving 
the above mentioned issue viz. supervised and unsupervised 
machine learning, text mining, semantic analysis, artificial 
intelligence etc. In the current work we will discuss the 
importance of two automatic techniques i.e. unsupervised 
machine learning and natural language processing and the 
softwares, tools and databases developed using these tech¬ 
niques, so that these techniques could be hnplemented on 
any biomedical corpus. Natural language processing (NLP) 
is the ability of a computer program to comprehend human 
language as it is spoken. The progress in Natural Language 
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Processing (NLP) applications is provocative because com¬ 
puters commonly require humans to “speak” to them in a 
programming language that is accurate, explicit and highly 
organized, or through a limited number of clearly articu¬ 
lated voice commands. Most of the research being done on 
Natural Language Processing (NLP) rotates around search 
i.e. keyword search or searching relationship entities. This 
Natural Language Processing (NLP) method enables users 
to query data sets in the form of a question that they might 
pose to another person. The machine elucidates the critical 
components of the human language sentence, such as those 
that might correspond to specific features in a data set, and 
returns an answer. Natural Language Processing (NLP) can 
be utilized to interpret free text and make it analyzable (51). 
The unsupervised machine learning approaches are gener¬ 
ally beneficial on the unstructured data i.e. the data where 
no labels are given to the learning algorithm, leaving it on its 
own to find structure in its input. This kind of learning can be 
a goal in itself by finding hidden patterns in data or a means 
towards an end i.e. feature learning used for the develop¬ 
ment of textual classifiers. The unsupervised learning prob¬ 
lems can be further grouped into clustering which is problem 
is where you want to find the inherent groupings in the data 
(52). Many researchers have utilized these approaches for 
information extraction from biomedical literature, especially 
for discovery of protein-protein interactions, gene-protein 
interaction, gene-dmg interaction, etc. In this paper we will 
discuss few softwares, databases or techniques which use 
Natural Language Processing (NLP) and unsupervised learn¬ 
ing approaches for classification and entity recognition from 
biofilm literature. 

Brief Description of Techniques 

This section presents a brief discussion on the Natural Lan¬ 
guage Processing (NLP) and unsupervised techniques and 
its general method for linguistics analysis to find different 
interactions (4). 

Natural Language Processing (NLP) methods. Knowledge¬ 
able discovery from unstructured text utilizes computational 
linguistics and philosophy, like syntactic parsing or semantic 
parsing to analyze sentence structures. Methods of this cat¬ 
egory define grammars to describe sentence structures and 
utilize parsers to extract syntactic information and internal 
dependencies within individual sentences. Approaches in 
this category can be applied to different knowledge domains 
after being carefully tuned to the specific problems. But, 
there is still no guarantee that the performance in the field 
of biomedicine can achieve comparable performance after 
tuning. Until recently, methods based on computational lin¬ 
guistics still could not generate satisfactory results (5) (6). 

Unsupervised Machine learning. Machine learning broaches 
to the potentiality of a machine to grasp from experience 
to extract knowledge from data corpora. As opposed to the 


aforementioned technique which needs laborious effort to 
define a set of rules or grammars, machine learning tech¬ 
niques are able to extract protein-protein interaction pat¬ 
terns without human intervention. Statistical approaches are 
based on word occurrences in a large text corpus. Significant 
features or patterns are detected and used to classify the ab¬ 
stracts or sentences containing protein-protein interactions, 
gene-protein interaction and characterize the corresponding 
relations among genes or proteins. They also define a set of 
rules for possible textual relationships, called patterns, which 
encode similar structures in expressing relationships. When 
combined with statistical methods, scoring schemes depend¬ 
ing on the occurrences of patterns to describe the confidence 
of the relationship are normally used. Similar to computa¬ 
tional linguistics methods, rule-based approaches can make 
use of syntactic information to achieve better performance, 
although it can also work without prior parsing and tagging 
of the text (7) (8). 

The Figure 1 shows the general outlook of information ex¬ 
traction system from any Biomedical Literature. In this the 
data is collected from various sources like published articles, 
scientific journals, books and technical reports, etc and the 
collected data is in unstructured format. Then using automat¬ 
ic techniques like text mining, text units i.e. words, sentenc¬ 
es, paragraphs containing relevant information are generated 
which needs to be analyzed to get knowledgeable data. Then 
these text units are further processed and analyzed using un¬ 
supervised learning and natural language processing which 
are used for text classification or clustering on certain tex¬ 
tual features and entity recognition like gene-protein inter¬ 
action, protein-protein interaction, gene-disease interaction, 
gene-drug interaction, etc. This gathered information can be 
used for the development of databases, classifiers, softwares, 
tools or pipelines for future use. 

DISCUSSION 

The softwares, tools, databases and pipelines which are in¬ 
volved in information extraction in the form of relationship 
entities in biofilm literature using Natural Language Pro¬ 
cessing and Unsupervised Learning approaches are shown 

in Table 1. 

The above mentioned softwares, tools, databases and pipe¬ 
lines can be used for information extraction by initially iden¬ 
tifying an item or concept in textual resource and then detect¬ 
ing links between the concepts obtained from the text. By 
linking the concepts together additional context is given to 
the concepts, which results in valuable knowledge that can 
be used for downstream analysis like genome and gene ex¬ 
pression annotation, drug-target discovery, drug reposition¬ 
ing, protein-protein interactions, construction of ontologies 
etc (9). These techniques also help researchers in formulat- 
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ing hypothesis of their future studies as they could find new 
concepts while analyzing text. 

CONCLUSION 

The importance of natural language processing and unsu¬ 
pervised learning depends on the fact that it not only extract 
information hidden in the biomedical textual data but could 
also be used for the development of new servers, softwares, 
databases, etc. These approaches could be used on any bio¬ 
medical literature. If the above mentioned tools meet all the 
challenges like specific ontologies describing single disease 
at various levels, individual pathways and genes for particular 
diseases, appropriate gene-disease interactions, quality of tool 
to distinguish between false negative results, etc. being faced 
in analysis of textual data they will continue to be an indispen¬ 
sable asset for researchers in the biomedical domain (9). 
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Figure 1: A general architecture of an information extraction system from Biomedical Literature which could be further used for 
the development of databases or pipelines. 
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Table 1: Overview of Natural Language Processing and Unsupervised Learning applications for the biomedi¬ 
cal (e.g. biofilm) domain. The table lists the tools/software/database/pipeline name, its brief description 
and URL. The approach used for the development is given in the bracket where NLP stands for Natural Lan¬ 
guage Processing and UL stands for Unsupervised Learning. 



CBioC (NLP) 

Collaborative Bio Curation uses automatic text extraction as a 
starting point to initialize the interaction database. After that, 
researchers in biomedical domain contribute to the curation 
process by subsequent edits ( 10 ). 

cbioc.eas.asu.edu 

Chilibot (NLP) 

This software for MEDLINE literature database to rapidly identify 
relationships between genes, proteins, or any keywords that the 
user might be interested (n). 

www.chilibot.net 

GoPubMed (NLP) 

This is a search engineer that allows users to explore PubMed 
search results with the Gene Ontology (GO), a hierarchically 
structured vocabulary for molecular biology ( 12 ). 

www.gopubmed.org 

iHOP (NLP) 

Information Hyperlinked over Proteins constructs a gene network 
by converting the information in MEDLINE into one navigable re¬ 
source using genes and proteins as hyperlinks between sentences 
and abstracts ( 13 ). 

www.ihop-net.org/UniPub/iE10P 

iProLINK (NLP 
and UL) 

This is a resource to facilitate text mining in the area of literature- 
based database curation, named entity recognition, and protein 
ontology development. It can be utilized by computational and 
biomedical researchers to explore the literature information on 
proteins and their features or properties ( 4 ). 

p ir. georgetown .edu/iprol ink 

PreBIND (NLP 
and UL) 

This tool helps researchers locate bio-molecular interaction infor¬ 
mation in the scientific literature. It identifies papers describing 
interactions using a support vector machine ( 4 ). 

prebind.bind.ca 

PubGene (NLP) 

This is constructed to identify the relationships between genes 
and proteins, diseases, cell processes, and so on based on their co¬ 
occurrences in the abstracts of scientific papers, their sequence 
homology, and statistical probability of their co-occurrences ( 4 ). 

www.pubgene .org 

Whatizit (NLP) 

It is a text processing tool that can identify molecular biology 
terms and linking them to publicly available databases. Identified 
terms are enveloped with XML tags that carry additional data, 
such as the primary keys to the databases where all the applicable 
information is kept. It is also a MEDLINE abstracts search engine 
(4). 

Free-text, natural language (English only) query for MEDLINE/ 

www.ebi.ac.uk/webservices/whatizit/info.jsf 

askMEDLINE 

askmedline.nlm. 

(NLP) 

PubMed ( 14 ). 

nih.gov/ask/ask.php 

XplorMed (NLP) 

The system provides the main associations between the words in 
groups of abstracts ( 9 ). 

xplormed.ogic.ca/ 

eTBLAST (NLP) 

A text-similarity based search engine, using all words in a para¬ 
graph to match similar documents ( 15 ). 

etest.vbi.vt.edu/ 

etblast 3 / 

Medline Ranker 
(NLP) 

Ranks MEDLINE abstracts based on user defined queries ( 16 ). 

cbdm.mdc-berlin.de/ 

-medlineranlcer/ 

cms/medline-ranker 

MiSearch (NLP) 

Ranks retrieved articles from PubMed based on a customized 
personal profile ( 17 ). 

portal, ncibi. 

org/gateway/ 

misearch.html 

PICO (NLP and 

Search engine for MEDLINE/PubMed with an 

pubmedhh.nlm. 

UL) 

integrated spelling checker ( 18 ). 

nih.gov/nlmd/pico/ 

piconew.php 

PubCrawler (NLP) 

Free alerting service that scans daily updates of the NCBI MED¬ 
LINE/PubMed and GenBank databases ( 19 ). 

pubcrawler.gen.tcd. 

ie/ 
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Table 1: (Continued) 


PubFocus (NLP) 


PubGet (NLP) 

PubMatrix (NLP 
and UL) 

PubNet (NLP) 


Statistical analysis of the MEDLINE/PubMed search queries 
enriched with additional information from journal rank database 
and forward referencing database ( 20 ). 

Retrieves PDFs directly based on a user defined query in PubMed 
(9)- 

Allows simple text based mining of the NCBI 

literature search service PubMed using any two lists of keywords 

terms, resulting in a frequency matrix of term co-occurrence ( 21 ). 

A flexible system for visualizing literature-derived networks ( 22 ). 


GeneValorization GeneValorization gives a very clear and handful overview of the 
(NLP) bibliography corresponding to user uploaded gene lists ( 23 ). 

DPWP (NLP) Disease/phenotype PAGE is a disease focused gene set analysis 
web tool to analyze microarray gene expression data with prede¬ 
fined groups of disease related genes ( 24 ). 

Anne O’Tate (NLP An overview of the set of articles retrieved by a PubMed query is 
and UL) generated. 


ProteinCorral 
(NLP and UL) 

MEDIE (NLP and 
UL) 


Combines information retrieval and extraction from MEDLINE 
( 25 ). 

Retrieve biomedical correlations from MEDLINE, based on index¬ 
ing by natural language processing and text mining techniques 
(9)- 


www.pubfocus.com/ 


pubget.com/ 

pubmatrix.grc.nia. 

nih.gov/secure-bin/ 

index.pl 

pubnet.gersteinlab. 
org / 

bioguide-proj ect. 
net/gv/ 

start_geneval.php 

dpwebpage.nia. 

nih.gov/ 

arrowsmith.psych. 
uic.edu/cgi-bin/ 
arrowsmith_uic / 
AnneOTate.cgi 

www.ebi.ac.uk/ 
Rebholz-srv/pcorral/ 

www.nactem.ac.uk/ 

medie/ 


PubReMiner 
(NLP and UL) 

PubViz (NLP) 

Quertle (NLP) 


CoPub(NLP) 

COREMINE 
Medical (NLP) 


FACTA+ (NLP 
and UL) 


Breaks down a results of a Pubmed query into 
Categories ( 26 ). 

An interactive MEDLINE search engine utilizing external knowl¬ 
edge ( 27 ). 

Using an amalgamation of linguistic systems, Quertle finds facts 
defined within documents, creating its own database of about 
300 million relationships, and is able to report the ones that are 
relevant to your query. 

Web application with gene focussed retrieval of co-occurring 
keywords from MEDLINE ( 28 ). 

Presents results about health, medicine and biology in a dash¬ 
board format comprised of panels containing various categories 
of information ranging from introductory sources to the latest 
scientific articles ( 9 ). 


hgserver 2 .amc.nl/ 

cgi-bin/miner/ 

miner 2 .cgi 

brainarray.mbni. 

med.umich.edu/ 

www.quertle.info/ 


www.copub.org 

www.coremine.com/ 


The associated concepts with text analysis based on a user queries www.nactem.ac. 
are found ( 29 ). uk/facta/ 


PPInterFinder 

(NLP) 

Reflect (NLP) 
AliBaba (NLP) 


Martini (NLP) 
STRING (NLP) 


PPInterFinder uses relation keyword co-occurrences with protein biomining-bu.in/ 
names to extract information on protein-protein Interactions ppinterfinder/html/ 
from MEDLINE abstracts ( 30 ). action.pl 


Reflect highlights protein and small molecule names, such as IL -5 reflect.embl.de 
and rapamycin in text. 


The PubMed abstracts parses for biological objects and their rela- alibaba.informatik. 
tions ( 31 ). hu-berlin.de/ 


Martini uses literature keywords to compare gene sets ( 32 ). 


martini.embl.de 


STRING is a database of known and predicted protein interac- string-db.org/ 
tions ( 33 ). 
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Table 1: (Continued) 


DiseaseConnect 

(NLP) 

DISEASES (NLP) 
NOBLE (NLP) 
Semantator (NLP) 
BioPortal (NLP) 

SimplevisGrid 

(NLP) 

miRiaD (NLP) 
BRONCO (NLP) 
MicrO (NLP) 

DiMeX (NLP) 

SimConcept (NLP 
and UL) 

TagLine (NLP and 
UL) 

MetaBioME 

(NLP) 

BIOADI (NLP and 
UL) 

PepBank(NLP) 
PIE (NLP and UL) 
GNAT (NLP) 


A comprehensive web server for 
mechanism-based disease-disease connections ( 34 ). 

Text mining and data integration of disease-gene associations 
(35)- 

Flexible concept recognition for large-scale biomedical natural 
language Processing ( 36 ). 

Semantic annotator for converting biomedical text to linked data 
(37)- 

An open repository of biomedical ontologies that provides access 
via Web services and Web 

browsers to ontologies developed in OWL, RDF, OBO format and 
Protege frames ( 38 ). 

Grid services for visualization of diverse biomedical knowledge 
and molecular systems data ( 39 ). 

A Text Mining Tool for Detecting Associations of microRNAs with 
Diseases ( 40 ). 

Biomedical entity Relation ONcology COrpus for extracting gene- 
variant-disease-drug relations ( 41 ). 

An ontology of phenotypic and metabolic characters, assays, and 
culture media found in prokaryotic taxonomic descriptions ( 42 ). 


A Text Mining System for Mutation-Disease Association Extrac¬ 
tion ( 43 ). 

A Hybrid Approach for Simplifying Composite Named Entities in 
Biomedicine ( 44 ). 

Information Extraction for Semi-Structured Text in Medical 
Progress Notes ( 45 ). 

A database to explore commercially useful enzymes in metagen- 
omic datasets ( 46 ). 

A machine learning approach to identifying abbreviations and 
definitions in biological literature ( 47 ). 

A database of peptides based on sequence text mining and public 
peptide data sources ( 48 ). 

An online prediction system for protein-protein interactions from 
text ( 49 ). 

Inter-species normalization of gene mentions ( 50 ). 


http://disease-connect.org 

http://diseases.jensenlab.org/ 

https://omictools.com/noble-coder-tool 

https://sbmi.uth.edu/ontology/project/se- 

mantator.htm 

http://bioportal.bioontology.org 


http ://biotm.cis .udel .edu/ miRiaD 

http://infos.korea.ac.kr/bronco 

http ://purl .obolibrary.org/ obo/MicrO .owl 
https://github.com/carrineblank/MicrO 
http://www.obofoundry.org/ontology/micro. 
html 

http ://biotm.cis .udel .edu/ dimex/ 

https://www.ncbi.nlm.nih.gov/research/bi- 
onlp/Tools/ simconcept/ 


http://metasystems.riken.jp/metabiome/ 

http://bioagent.iis.sinica.edu.tw/BIOADI/ 

http://pepbank.mgh.harvard.edu/ 

http://bi.snu.ac.kr/pie/ 

http://cbioc.eas.asu.edu/gnat/ 

http://bcms.bioinfo.cnio.es 
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